The lexer hack

When parsing computer programming languages, the lexer hack (as opposed to "a lexer hack") describes a common solution to the problems which arise when attempting to use a regular grammar-based lexer to classify tokens in ANSI C as either variable names or type names.

Contents

Problem

In a compiler, the lexer performs one of the earliest stages of converting the source code to a program. It scans the text to extract meaningful tokens, such as words, numbers, and strings. The parser analyzes sequences of tokens attempting to match them to syntax rules representing language structures, such as loops and variable declarations. A problem occurs here if a single sequence of tokens can ambiguously match more than one syntax rule.

This ambiguity can happen in C if the lexer does not distinguish between variable and typedef identifiers.[1] For example, in the C expression:

(A) * B

the lexer may find these tokens:

  1. left parenthesis
  2. identifier 'A'
  3. right parenthesis
  4. operator '*'
  5. identifier 'B'

The parser can interpret this as variable A multiplied by B or as type A casting the dereferenced value of B. This is known as the "typedef-name: identifier" problem.[2]

The hack solution

The solution generally consists of feeding information from the parser's symbol table back into the lexer. This incestuous mixing of the lexer and parser is generally regarded as inelegant, which is why it is called a "hack". The lexer cannot distinguish type identifiers from other identifiers without extra context because all identifiers have the same format.

With the hack in the above example, when the lexer finds the identifier A it should be able to classify the token as a type identifier. The rules of the language would be clarified by specifying that typecasts require a type identifier and the ambiguity disappears.

The problem also exists in C++ and parsers can use the same hack.[1]

Alternative solutions

This problem does not arise (and hence needs no "hack" in order to solve) when using lexerless parsing techniques.

Some parser generators, such as the yacc-derived BtYacc ("Backtracking Yacc"), give the generated parser the ability to try multiple attempts to parse the tokens. In the problem described here, if an attempt fails because of semantic information about the identifier, it can backtrack and attempt other rules.[3]

References

Citations